RL in Classic Games

Optimality in Games

Optimal strategies in games are usually defined as the strategies that maximize the expected return. In games with multiple players, the optimal strategy depends on the strategies of the other players.

If the opponents have fixed strategies, the optimal strategy can be found. We can treat the opponents as part of the environment and solve the game as a single-agent problem. This is the case in games like Tic-Tac-Toe, where the opponent is following a fixed strategy.

Nash Equilibrium: Is a joint policy where no player can improve their expected return by unilaterally changing their strategy.

lets say that optimal policy for player i is πi\pi^i

if all other players fix their strategies to π−i\pi^{-i} Best response for player i is

πi=π∗i(π−i)\pi^i = \pi_*^i(\pi^{-i})

This is the Nash Equilibrium

An example of a game with Nash Equilibrium is the Rock-Paper-Scissors game. If the opponent is playing randomly, the optimal strategy is to play randomly as well. In this case, the Nash Equilibrium is the uniform distribution over the actions. However, if the opponent is playing a fixed strategy, let's say always rock, the optimal would be to play paper. In this case, the rock player will always lose, and the paper player will always win, hence the Nash Equilibrium is not achieved.

Nash equilibrium is the best an agent can do in a game.

Best response is solution to single-agent RL problem

Nash equilibrium can be found by solving the game as a single-agent problem. The agent can learn the optimal strategy by playing against itself. This is called self-play.

Learning and adapting to the self-played games is a common approach to find the optimal strategy in games. Each agent learns best response to other players. One player’s policy determines another player’s environment. All players are adapting to each other.

Is the Nash Equilibrium unique?

Two-Player Zero-Sum Games

R1+R2=0R^1 + R^2 = 0

The reward of one player is the negative of the reward of the other player. The condition of a player to do better is the other player to do worse.

Both players are fighting against each other. The reward of one player is the loss of the other player.

Game Theory

Game theory is the study of strategic interactions between rational agents.

There are two ways of finding Nash equilibria in two-player zero-sum games:

A perfect information game is a game where all players know the state of the game at all times.

An imperfect information game is a game where players don't know the state of the game at all times.

MiniMax

Value function defining the expected total reward given joint policies π=⟨π1,π2⟩\pi = \langle \pi^1, \pi^2 \rangle:

Vπ=Eπ[Gt∣St=s]V_\pi = E_{\pi}[G_t | S_t = s]

A minimax value function maximizes white's expected return while minimizing black's expected return.

v∗(s)=maxπ1minπ2vπ(s)v_{*}(s)= max_{\pi^1} min_{\pi^2}v_\pi(s)

A minimax policy is a Nash equilibrium.

MiniMax Search Example

The minimax algorithm is used in decision-making and game theory to minimize the possible loss for a worst-case scenario. When dealing with two players, one aims to maximize their score (max player), and the other aims to minimize the score (min player).

Tree Structure

The tree has different levels, alternating between max and min:

Steps to Evaluate the Tree

  1. Leaf Nodes: Start by evaluating the leaf nodes. These are the final payoffs at the end of the game.

  2. Max Level (Third Level):

    • Nodes under b1 b_1 :
      • +7 +7 and +3 +3 lead to choosing the maximum, which is +7 +7 .
      • −4 -4 and −2 -2 lead to choosing the maximum, which is −2 -2 .
    • Nodes under b2 b_2 :
      • +5 +5 and +9 +9 lead to choosing the maximum, which is +9 +9 .
      • −6 -6 and −4 -4 lead to choosing the maximum, which is −4 -4 .
  3. Min Level (Second Level):

    • Nodes under a1 a_1 :
      • +7 +7 and −2 -2 lead to choosing the minimum, which is −2 -2 .
    • Nodes under a2 a_2 :
      • +9 +9 and −4 -4 lead to choosing the minimum, which is −4 -4 .
  4. Max Level (Top Level):

    • Nodes under a1 a_1 and a2 a_2 :
      • −2 -2 and −4 -4 lead to choosing the maximum, which is −2 -2 .

Main problem is the exponential growth of the tree. The number of nodes in the tree is bd b^d , where b b is the branching factor and d d is the depth of the tree.

Use value function to estimate minimax value at leaf nodes.

Minimax search run to fixed depth with respect to leaf values.

Self-Play Temporal-Difference Learning

The idea is to use basic RL methods to self-play games and learn the optimal strategy. We need to change the definition of the game to self play.

Simple TD

TD Root

TD Leaf

TreeStrap


#MMI706 - Reinforcement Learning at METU